Predicting Protein-Protein Interactions via Random Ferns with Evolutionary Matrix Representation

Protein-protein interactions (PPIs) play a crucial role in understanding disease pathogenesis, genetic mechanisms, guiding drug design, and other biochemical processes, thus, the identification of PPIs is of great importance. With the rapid development of high-throughput sequencing technology, a large amount of PPIs sequence data has been accumulated. Researchers have designed many experimental methods to detect PPIs by using these sequence data, hence, the prediction of PPIs has become a research hotspot in proteomics. However, since traditional experimental methods are both time-consuming and costly, it is difficult to analyze and predict the massive amount of PPI data quickly and accurately. To address these issues, many computational systems employing machine learning knowledge were widely applied to PPIs prediction, thereby improving the overall recognition rate. In this paper, a novel and efficient computational technology is presented to implement a protein interaction prediction system using only protein sequence information. First, the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) was employed to generate a position-specific scoring matrix (PSSM) containing protein evolutionary information from the initial protein sequence. Second, we used a novel data processing feature representation scheme, MatFLDA, to extract the essential information of PSSM for protein sequences and obtained five training and five testing datasets by adopting a five-fold cross-validation method. Finally, the random fern (RFs) classifier was employed to infer the interactions among proteins, and a model called MatFLDA_RFs was developed. The proposed MatFLDA_RFs model achieved good prediction performance with 95.03% average accuracy on Yeast dataset and 85.35% average accuracy on H. pylori dataset, which effectively outperformed other existing computational methods. The experimental results indicate that the proposed method is capable of yielding better prediction results of PPIs, which provides an effective tool for the detection of new PPIs and the in-depth study of proteomics. Finally, we also developed a web server for the proposed model to predict protein-protein interactions, which is freely accessible online at http://120.77.11.78:5001/webserver/MatFLDA_RFs.


Introduction
Recognition of protein-protein interactions (PPIs) is distinctly important for understanding various cellular biological activities [1]. The knowledge of PPIs can help us to explore and elucidate the functions of proteins, drug design, new drug development, and the mechanisms of biological activity and related proteins in cells [2]. Additionally, it can also provide new ideas for other studies, such as the ranking of disease genes [3], functional module identification [4], and human disease prevention and treatment. In general, the research approaches for PPIs mainly include two categories: computational-based methods and biological experimentalbased methods. In the last decades, many different experimental techniques have been used for large-scale PPIs validation, such as yeast two-hybrid (Y2H) screens [5], coimmunoprecipitation (Co-IP) [6], nuclear magnetic resonance (NMR) [7], protein chip [8], and other high-throughput biological techniques. However, there are some inevitable disadvantages of these methods: they are not only time-consuming and expensive but also suffer from high false-positive rates and weak generalization ability. Thus, it has great practical significance to develop a new effective machine learning approach for PPIs prediction in order to save cost and time, thereby ultimately improving the prediction accuracy of protein interactions. To date, numerous computational approaches have been suggested to detect PPIs based on different data types, including protein domains, genomic information, evolutionary knowledge, structure information, gene fusion, and phylogenetic profiles [9][10][11][12][13][14]. Although these methods can be used to detect PPIs, the abovementioned methods are not universally applicable unless prior knowledge of the protein is known. Although amino acid sequence information is readily available for a large number of proteins, the 3D structural information of many proteins is still unclear, and the known and available PPIs for most species are still incomplete or very sparse. Consequently, it is particularly important to design novel computational methods for PPI prediction utilizing only protein amino acid sequence information, so as to better employ these abundant protein sequence data.
Numerous previous works have shown that using protein amino acid sequence information alone is sufficient to predict PPIs. So far, many different computational methods based on sequence information have been presented to implement this pattern in PPI prediction, such as combining average blocks with relevance vector machine [15], combining principal component analysis with ensemble extreme learning machine [16], combining conventional auto covariance with support vector machine [17], local descriptors using k-nearest neighbor [18], discrete cosine transformation using weighted sparse representation model [19], and so on. In 2017, Wang et al. [20] proposed a PCVMZM method based on protein sequence. The Zernike moments (ZM) are used as the feature extraction method. ZM can capture multiangle useful and representative information. Probabilistic classification vector machines (PCVM) are a sparse classification model that optimizes the kernel parameters by the expectation-maximization (EM) algorithm, which not only improves the prediction performance of PPIs but also reduces the computational time in the testing phase. The average prediction accuracy achieved by the PCVMZM method was 94.48% on the Yeast dataset. In the same year, Du et al. [21] proposed a method called DeepPPI from the angle of deep learning technology by using amphiphilic pseudo amino acid composition feature extraction algorithm to extract features from amino acid sequences, which opens a new way for studying PPIs. This DeepPPI method reached a prediction accuracy of 94.43% on the Saccharomyces cerevisiae dataset. In 2018, Göktepe and Kodaz [22] applied a new technique called weighted skip-sequential conjoint triads to predict PPIs. The method adopts principal component analysis (PCA) to remove noise information, captures protein sequence information by combining Bi-gram representation and Pseudo-amino acid composition, and finally uses support vector machine (SVM) as a prediction classifier to identify interactions between proteins. In the same year, Song et al. [23] presented a novel feature fusion scheme based on random projection ensemble method, which separately used three algorithms (fast fourier transform, discrete cosine transform, and singular value decomposition) to explore and denote the patterns of interactions between amino acids. In 2019, Chen et al. [1] developed an end-to-end framework, called PIPR, to predict PPIs using only the protein sequences. They capture effectively the local significant features and sequential features from protein sequence pairs by using a deep residual recurrent convolutional neural network. Experimental results demonstrate that the framework has good scalability on different datasets. In the same year, Beltran et al. [24] used five feature extraction methods, namely, dipeptide composition, tripeptide composition, autocovariance, amino acid composition, and pseudo-amino-acid composition to represent amino acid sequences. They then employed SVM, random forest (RF), and extreme gradient boosting (XGBoost) to predict PPIs, respectively, and finally achieved good prediction performance. More recently, Jha and Saha [25] presented a deep-learning-based predictor to identify PPIs. They introduced two deep learning algorithms, ResNet50 and stacked autoencoder, to extract features from the autocovariance and conjoint triad representations of protein sequences. Then, LSTM-based classifier model was constructed for each feature encoding scheme. The experimental results show that the introduced deep learning scheme can learn valuable features from multimodal information of proteins. Although a number of computational-based methods have achieved good progress and application prospects, the accuracy and efficiency of PPIs prediction still need to be further enhanced so as to provide a supplementary tool for proteomics research and other bioinformatics tasks.
In this paper, an efficient computational method for detecting PPIs from amino acid sequences is presented by using the evolutionary matrix representation of protein sequences and combining with an ensemble classifier. Among them, an important improvement of the proposed model is to develop a more accurate numerical representation of protein sequences. Specifically, we applied the MatFLDA feature extraction algorithm to a position-specific scoring matrix (PSSM) to extract the evolutionary information of protein sequences and utilized a random ferns classifier to predict the PPIs. More specifically, each protein sequence is denoted as a PSSM numerical matrix. Subsequently, for the purpose of obtaining more representative information, we utilize the MatFLDA descriptor to extract the feature information in each PSSM, so as to obtain a 400dimensional feature vector from the model and thus obtain an 800-dimensional feature vector representation of each protein pair. Finally, we employ the feature vector of protein pairs as the input of the model and combine the RF ensemble model in machine learning to accomplish the classification task of PPIs. The proposed method is estimated on the PPI datasets of Yeast and H. pylori with prediction accuracy of 95.03% and 85.35%, respectively. By comparing with a series of previous computational methods, we clearly found that the proposed model has good generalization performance in predicting PPIs.

Materials and Methodology
2.1. Datasets. So far, a number of PPIs databases have been created, including HAPPI database [26], Molecular Interaction Database (MINT) [27], APID database [28], Biomolecular Interaction Network Database (BIND) [29], and 2 Computational and Mathematical Methods in Medicine Database of Interacting Proteins (DIP) [30]. In this section, we use two high-quality benchmark datasets, which are extracted from DIP, to test the generality of the model and assess the performance of the proposed method. The first dataset is the yeast dataset collected by Guo et al. [17]. To evaluate our method, a data preprocessing procedure that deleted protein pairs of greater than 40% sequence identity and less than 50 residues was used in this experiment to avoid the bias introduced by these homologous sequence pairs. By performing this process, we extracted 5594 protein pairs which formed the golden standard positive dataset. The additional 5594 protein pairs were retained to construct the golden standard negative dataset by removing interaction pairs with the same subcellular localization information. The second dataset is the H. pylori dataset, which was validated by the yeast two-hybrid technology [31] and collected by Martin et al. [32]. The PPI dataset of H. pylori contains 1458 positive protein pairs and 1458 negative protein pairs, which are regarded as positive and negative datasets, respectively. Consequently, yeast and H. pylori datasets are composed of a total of 11,188 and 2916 protein pairs, respectively.

Numerical Characterization of Protein Sequences.
Position-Specific Scoring Matrix (PSSM) serves as a very useful scoring matrix that can contain evolutionary information of protein sequences, which is crucial in proteomics. PSSM was originally introduced by Gribskov et al. [33] in 1987 and is commonly used to detect distantly related proteins and protein folding patterns [34]. Currently, some researchers have done a lot of related work using PSSM encoding information in many fields of bioinformatics such as identification of DNA binding proteins [35], the identification of drug-target interaction [36], prediction of membrane protein types [37], and protein-protein interaction site prediction [38]. In this experiment, we employed the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) [39] to convert each protein sequence into a PSSM, which is widely adopted for the numerical representation of protein sequences for further use in PPI detection tasks. PSSM is a matrix composed of T rows and 20 columns, where the row represents the length of the protein sequence and 20 columns are attributed to the 20 naive amino acids. Suppose that M = f∂ i,j : i = 1, ⋯, T and j = 1, ⋯, 20g, PSSM can be described as follows: : ð1Þ The elements in this matrix usually contain positive or negative integers, where the element ∂ i,j is the probability that the ith amino acid mutates into the jth amino acid in the process of biological evolution. Here, positive scores in this matrix mean that amino acid substitutions occur more frequently in the alignment, whereas negative scores mean that the substitution occurs less frequently.
In our study, we set the e-value and iteration times of PSI-BLAST, which are 0.001 and 3, respectively, to obtain highly and broadly homologous protein sequences. Consequently, each protein sequence is denoted as a 20dimensional matrix containing T × 20 elements, where T is the length of a given protein sequence and 20 indicates the number of amino acids. The application information of PSI-BLAST can be downloaded at http://blast.ncbi.nlm.nih .gov/Blast.cgi [40,41].

Matrix Fisher Linear Discriminant Analysis (MatFLDA).
Fisher linear discriminant analysis (FLDA), as a popular feature extraction method [42], has recently gained considerable attention in the areas of data mining and pattern recognition, such as software fault prediction [43], Arabic text classification [44], and face recognition [45]. In Section 2.2, each PSSM can be denoted as M = f∂ i,j : i = 1, ⋯, T and j = 1, ⋯, 20g, which is a T × 20 matrix. To construct the FLDA of the matrix pattern, we give the matrix pattern A ij for the ith class containing N i samples, which can be denoted as where represents the number of PSSMs, and the total sample mean is defined as A: For Matrix Fisher Linear Discriminant Analysis (MatFLDA), assume that a class matrix pattern A i , i = 1, 2, ⋯, C containing C classes is given, where C = 20 represents the 20 classes of amino acids, and their class mean is A i : Let x be a vector with m components. MatFLDA aims to project a matrix pattern A onto the x satisfying the constraint that x T x = 1, and then a 1 × n dimensional feature matrix can be generated by using the following linear transformation.
where y is an extracted feature matrix or projected value. Hence, for each matrix pattern A ij , all their feature matrices are projected as follows: To find the optimal projection vector x, we use the following criterion function and maximize it: where S Mat b is the total between-class scatter matrix, which is defined as where S Mat w is the total within-class scatter matrix, which is defined as In the MatFLDA algorithm, by maximizing J Mat ðxÞ, we want to keep the between-class scatter matrix as large as possible and the within-class scatter matrix as small as possible in the projection space. Furthermore, under the constraint x T x = 1, this optimization problem can be further equated to solve the following eigenvalue-eigenvector matrix equation: At last, the completely new features are obtained by determining the appropriate x, which will be used in the subsequent classification task. In this experiment, the PSSM of N protein sequences of size T × 20 was used as input to the Matflda algorithm on the yeast and H. pylori datasets, where the Matflda algorithm was only used for feature extraction. In this way, we obtained the output of a 20 × 20 dimensional feature matrix by using the MatFLDA algorithm on an original PSSM of protein sequence. In other words, we obtained a feature vector of 1 × 400 dimensions from each PSSM. Consequently, the output of N PSSMs is N fixed size 20 × 20 dimensional feature matrices. Thus, each protein pair contains 800 features. Here, in order to clearly understand how to use the MatFLDA algorithm for feature extraction of protein sequences, we give a schematic diagram of MatFLDA feature extraction for a protein pair namely Histone H4 and Regulatory protein SIR3 in the Saccharomyces cerevisiae dataset, as shown in Figure 1.

Random Ferns (RFs).
Random fern classifier is developed based on random forests, but it is different from the random forest [46,47]. Here, by giving a PSSM in a protein sequence, our main task is to assign it to the most likely class. Let c i , i = 1, 2, be the set of classes, where 1 indicates an interacting protein and 2 is a noninteracting protein. Let x j , j = 1, 2, ⋯, N, be the set of normalized 20 × 20 dimensional features that will be calculated by using the MatFLDA algorithm on the PSSM that we are trying to classify. Formally, we are looking for [48] c i ′ = arg max where C, a random variable, represents the class of protein.
The aim of this paper is to model the posterior interacting protein class probability by giving a set of N features. This can be expressed in terms of the Bayesian formula, as Assuming a uniform prior PðCÞ, since the denominator is just a scale factor, it is independent and is common for all the classes. Thus, by removing the priors Pðx 1 , x 2 , ⋯, x N Þ, the problem reduces to finding But learning the complete representation of the joint probability of all features is very intractable. According to the Naive Bayes theory, it is assumed that all features are completely independent, that is, However, this independence assumption is usually wrong because it completely ignores the correlation between features in practice. To account for the dependencies between these features while making the problem tractable, a better compromise is to divide our features into M groups of size S = N/M: These groups are what we define as ferns, and we calculate the joint probability for features in each fern. The conditional probability is expressed as follows: where F k = fx ϑðk,1Þ , x ϑðk,2Þ , ⋯, x ϑðk,SÞ g, k = 1, ⋯, M, refers to the kth fern, and ϑðk, jÞ is a random permutation function. Therefore, we follow a seminaive Bayesian method by modeling only some of the dependencies between features. In addition, the class conditional probabilities PðF m jC = c i Þ are estimated for each fern F m and class c i in the training phase. For each fern F m , these terms can be described as where N k,c i represents the number of training samples of class c i that evaluates to fern value k, k = 1, 2, ⋯, K: Here, K = 2 S , and N c i represents the total number of samples for class c i : However, when the number of samples given is not infinitely large, both N k,c i and p k,c i will be zero. To overcome this problem, p k,c i is rewritten as where N r is a regularization term, which behaves as a uniform Dirichlet prior over feature values. N r = 1 is used to guarantee the results above zero. In this experiment, we set two important parameters of the random ferns classifier, where S (the depth of ferns) was set to 20 and M (the number of ferns) was set to 140. Finally, the features extracted by the MatFLDA algorithm are normalized and then fed into the random ferns classifier to predict whether each protein pair interacts with each other.

Results and Discussion
3.1. Evaluation Criteria. In this paper, to ensure the robustness of the proposed model and avoid overfitting and data dependency, we adopted five-fold cross-validation to assess the effectiveness of this method in predicting PPIs. Specifically, we first divide the experimental dataset into five parts and then select four of them as the training dataset and the additional one as the testing dataset. Finally, the average values of the five independent experiments are used as prediction results. Here, the following assessments are used, including overall prediction accuracy (ACC), precision (PE), sensitivity (SN), and Matthews correlation coefficient (MCC), which are defined as follows where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. Among them, TP indicates the number of true PPIs that are predicted correctly, TN represents the number of true noninteracting pairs that are predicted correctly. FP indicates the number of true interacting pairs not found in positive dataset, and FN represents the number of true interacting pairs not found in negative dataset. MCC is used as a balance indicator to measure the quality of binary classification in data mining, which value ranges between -1 and +1 representing the correlation coefficient between the observed results and the predicted results. In this experiment, the receiver operating characteristic (ROC) curve [49] and the area under the ROC curve (AUC) [50] are employed to evaluate the prediction performance of the proposed model. The AUC value of the classifier is larger, the prediction performance of the method is superior, and the model constructed is more stable. The flow of the proposed scheme is shown in Figure 2.      Table 2 that the standard deviations corresponding to these four evaluation values are 0.64%, 0.81%, 0.92%, and 1.14%, respectively. In order to better visualize the performance of combining RFs and MatFLDA to predict PPIs, we plot the ROC curves on two benchmark datasets. In addition, MCC and AUC values were also calculated to better quantify the predictive performance of the proposed model. The ROC curves performed on the two benchmark datasets are shown in Figures 3 and 4. From Figures 1 and 2, we can see that the average AUC values obtained by the proposed method were 94.27% and 94.12% for the experiments on Yeast and H. pylori datasets, respectively. The promising results show that the proposed method is feasible, effective, and practical for detecting PPIs. The excellent prediction performance mainly depends on the selection of the feature extraction algorithm and classification model of the proposed method. It can be seen that the MatFLDA feature extraction descriptor can effectively retain useful information from the original protein sequences. Moreover, the high prediction accuracies and low standard deviations further indicate that the proposed method is robust for predicting PPIs.

Comparison of the Four Methods Using the Same Feature
Representation. Generally, the same feature extraction approach by combining different classifiers will yield different prediction results when using machine-learning-based methods to predict PPIs. In this section, we performed PPI experiments using the same feature extraction method on  the state-of-the-art individual classifier support vector machine (SVM) and the proposed ensemble learning classifier random ferns in order to further evaluate the prediction performance of the proposed model. It should be noted that the LIBSVM toolbox, which was downloaded from https:// www.csie.ntu.edu.tw/~cjlin/libsvm/ [51], was employed in this experiment to carry out the PPI classification task. In our experiment, a polynomial function is used as the kernel function and the initial values of SVM are c = 0:1, g = 0:2 and c = 0:01, g = 0:1 when predicting PPIs using five-fold cross-validation on Yeast and H. pylori datasets, respectively. For SVM and RF classifiers, all input feature vectors are normalized by the zero-mean normalization method.
The experimental results of PPIs based on RFs and SVM-based classifiers are presented in Tables 3 and 4

on
Yeast and H. pylori datasets, respectively. From Table 3, the average values of accuracy, precision, sensitivity, and MCC of the RF method on Yeast dataset are as high as 95.03%, 99.14%, 90.84%, and 90.52%, respectively. However, when employing the SVM classifier, we yielded relatively poor prediction results with the average values of accuracy, precision, sensitivity, and MCC of 80.39%, 83.01%, 76.44%, and 68.38%, respectively. It can be observed that the maximum accuracy obtained by the SVM classifier is 81.63%, which is 13% lower than the minimum accuracy obtained by the RF method. Similarly, as presented in Table 4, the average accuracy by utilizing SVM method in H. pylori dataset is 82.09%, among which the results of five models are 82.85%, 82.33%, 79.42%, 82.33%, and 83.53%, respectively. Additionally, for further evaluation, the ROC (receiver operating characteristic) curves and AUC values based on the SVM method are also calculated (see Figures 5 and 6). The average AUC values obtained by the same feature extraction method on Yeast and H. pylori datasets were 85.78% and 88.94%, respectively. In addition, we also evaluate the prediction performance of the proposed model using Random Forest and XGBoost classifiers by employing the same features. Comparing the proposed model with these three models, we can clearly see the proposed model achieves good performance in the prediction of PPIs. Thus, the proposed model can provide a useful tool for detecting PPIs and other bioinformatics tasks.

Comparison with other PPI Prediction
Methods. Currently, many computational methods that are based on data mining knowledge have been presented for predicting sequence-based PPIs. In this section, to verify the performance of the proposed model, we measure the proposed method by comparing with several other state-of-the-art methods on the Yeast and H. pylori datasets. Specifically, we compared the proposed method with previous work on PPI prediction presented by Guo Table 5 lists the PPI prediction results of the above methods on the same Yeast dataset.
As shown in Table 5, the accuracy, sensitivity, precision, and MCC of the MatFLDA_RFs method are 95.03%, 90.84%, 99.14%, and 90.52%, respectively. Compared with other existing methods listed, the accuracy of the proposed method increased by about 0.1% to 9%. The ACC of MatFLDA_RFs method is 7.67% higher than the AC method, 8.88% higher than the Cod4 + KNN method, 6.47% higher than the SVM + LD method, 3.67% higher than the MCD + SVM method, 0.89% higher than the LRA + RF method, 0.60% higher than the DeepPPI method, and 1.11% higher than the PR − LPQ + RF method. The PE of MatFLDA_RF method is 11.32% higher than the AC method, 8.90% higher than the Cod4 + KNN method, 9.64% higher than the SVM + LD method, 7.20% higher than the MCD + SVM method, 2.04% higher than the LRA + RF method, 2.49% higher than the DeepPPI method, and 2.69% higher than the PR − LPQ + RF method. The MCC of MatFLDA_RFs method is 13.37% higher than the SVM + LD method, 6.31% higher than the MCD + SVM  Computational and Mathematical Methods in Medicine method, 1.56% higher than the LRA + RF method, 1.55% higher than the DeepPPI method, and 1.96% higher than the PR − LPQ + RF method. Similarly, Table 6 presents the PPI prediction results of other existing methods on the same H. pylori dataset. As shown in Table 6, the prediction performance of the proposed method is better than other existing methods. The obtained values of ACC, SN, PE, and MCC are 85.35%, 95.72%, 79.27%, and 74.41%, respectively. In terms of ACC, the MatFLDA_RFs method is 0.44%-9.55% higher than other methods, 1.95% higher than the Signature Products + SVM method, 0.44% higher than the MCD + SVM method, 1.65% higher than the WSR method, 9.55% higher than the Phylogenetic Booststrap method, 2.35% higher than the LDC method, and 5.83% higher than the Boosting method. These excellent results prove that the proposed method is an effective computational tool suitable for predicting PPIs.

Conclusion
The study of proteins and their interactions is essential to understand most biological activities in living cells, such as development, signal transduction, and apoptosis. Therefore, the successful prediction of PPIs will facilitate the study of other related problems in biomedical science. In this work, we present a novel computational approach to detect PPIs, using the MatFLDA algorithm, the RF classifier, and the PSSM matrix that can preserve protein evolutionary information. More specifically, MatFLDA is used to obtain the feature representation from the PSSM, an evolutionary matrix of protein sequences. This PSSM contains a great deal of valuable and important knowledge for PPI prediction. The RF classifier is then applied to detect novel PPIs. Finally, to measure the PPI identification ability of the developed method, we conducted extensive computational experiments on several benchmark PPI datasets. These excellent experimental results have indicated that the proposed MatFLDA_RF method has a higher identification rate of PPIs than other existing methods and SVM-based approaches. Consequently, the proposed method to identify PPIs is reliable and effective, which can be used as a practical tool for experimental methods, thus, facilitating further research on related problems in the field of bioinformatics.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare no conflict of interest.